Production dataΒΆ
Graphical explorationΒΆ
Predictions units all follow similar day-night and seasonal cyclesΒΆ
This is reasonable for solar power.
Some unit show obvious examples of production being addedΒΆ
Normalizing to installed capacityΒΆ
The installed capacity on any given day is an upper limit on production.
When new capacity is installed between the date that we are predicting and the date when the data is measured, this throws off the normalization.
This will be an problem for prediction, but not for training. Correcting it would require predicting when new capacity will be installed.
Direction: These changes in production offer an opportunityΒΆ
We can test how good the model of production as a fraction of capacity is: centering on changes, there should be no statistical difference between production as a fraction of installed capacity before and after the new installation
First, however, we need to know the distribution of production values
Histograms of capacity-normalized dataΒΆ
The data is dominated by small values due to nights and winters
Winter and weather mean that small values dominate, even at noon.
If we limit ourselves to the summer (June-August) at noon, we see a very different distribution of production values. This time it is more flat, and peaks at a large value.
This represents a challenge for modelling the data:ΒΆ
The impact of the noise (ie. weather) depends on the context of other seasonal features. Any model that doesn't take this into account will fail to properly model the noise
- Direction: One option to address this is to run a regression against the parameters of a beta distribution. This would capture the distribution of outputs that we expect, and could be useful for future models.
Regression against a single production unitΒΆ
Linear modelΒΆ
Using join time x year features
The fits don't look ok.ΒΆ
- The residual plot show strong heteroskedasticity, as we expect from the histogram analysis
- The error shows time (magnitude) dependency.
linear model of fold changesΒΆ
Linear vs logarithmic doesn't really matter. The differences in the models are miniscule.
Note: the heteroskedasicity is still present.
model with autocorrelationΒΆ
The above residuals show strong autocorrelation
Removing the 48hr autocorrelation show a fairly negligible impact.
Overall, the linear model seems quite poorly suited to this data.ΒΆ
It may still be helpful as a starting point for models.